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We consider the evolution of a population of fixed size with no selection. The number of gener- 
ations G to reach the first common ancestor evolves in time. This evolution can be described by a 
simple Markov process which allows one to calculate several characteristics of the time dependence 
of G. We also study how G is correlated to the genetic diversity. 



I. INTRODUCTION 

One of the simplest questions one can ask about the history of an evolving population is the age of its most recent 
common ancestor (MRCA). As the population evolves, the age of this MRCA as well as the genealogical tree keep 
changing with an endless appearance of new branches and disappearance of old branches. These perpetual changes in 
the genealogy are accompanied by sudden jumps of the age of the MRCA, which correspond to the extinction of one 
of the oldest branches [l| ■ In the first part of the present paper wc try to describe the evolution of this age in one of 
the simplest models of an evolving population, the Wright-Fisher model with no selection. 

Analysis of the human genome makes possible the precise comparison of the DNA sequences of individuals in a 
population. The number of differences between the sequences of a group of individuals is a testimony of the time 
passed since their common ancestors and one may hope to infer the history of the group from the knowledge of its 
DNA sequences El Ej El . The task is however immense as many factors interfere : selection history, demography 
El , geography [HJ , diploidy 0, ^| . In order to attack the problem of estimating the age of the MRCA from 
the observed DNA sequences at a given generation, a number of models have been studied 0, Q , where at most few of 
these factors are included. The goal is always to correlate the observed genetic diversity of the population at a given 
generation to the age of this MRCA. However it is difficult to characterize a sample of DNA sequences by a single 
parameter which would measure its genetic diversity. Ideally the optimal parameter would be to find a measure of the 
genetic diversity at a given generation which would be as correlated as possible to the age of the MRCA. In practice, 
one often uses Tajima's estimator |l4j which counts the number of different base pairs between pairs of individuals. 
But the more precise the characterization of the genetic diversity is, the more difficult the calculations are Here 
we consider in the second part of this paper the simple case of the infinite allele model, where the only information wc 
keep about pairs of individuals is whether they have the same allele or not and we try to calculate how the distribution 
of the age of the MRCA is correlated to this information. 

The simplest models one can consider consist in defining some stochastic rules which relate each individual (and its 
genome) to its parent in the previous generation and the above questions can be formulated as steady state properties 
of simple non equilibrium systems : for example the coalescence process described below can be viewed as a reaction- 
diffusion process A + A — > A. The coalescing trees observed in genealogies have also striking similarities with the 
ultrametric structures which emerge in the theory of sp in g lasses and disordered systems |l5l llrJ ] . This is why they 
motivate a growing interest among statistical physicists [l7| . 

We consider here a population of N individuals evolving according to the Wright-Fisher model (see |4j for a general 
introduction) : successive generations do not overlap, at each new generation all the individuals are replaced by ./V 
new ones and each individual has one parent chosen randomly in the previous generation. 

Many results are known in absence of selection, such as the distribution of the age of the MRCA (Til IT^ | , the 
stochastic dynamics of the frequency of a gene [20L l2lj . In the last part of this introduction, we recall few known 
results that we will use in the rest of the paper. 

Recently Serva adressed the problem of the temporal dynamics of the age of the MRCA. In section [HI we show 
how to describe these dynamics as a simple Markov process which allows one to calculate all the correlations between 
these MRCA ages at different generations. 
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Figure 1: Top of the genealogical tree of a large population. When the size N of the population is large, coalescences at the 
top of the tree occur only among pairs of individuals and the coalescence times r, are independent random variables. One can 
see that the two ancestors A\ and A2 generate all the population in the present generation. 



One can associate to each individual a gene (or a genome) . In the section IIIII where we try to correlate the 
genetic diversity to the age of the MRCA, we will consider the infinite allele case : each mutation creates a new 
genome, different from all the genomes which had previously appeared in the whole history of the population. At 
each generation, there is a probability 0/N of mutation in the transmission of each genome. This means that each 
new individual inherits the genome of its parent with probability 1 — JN and receives a new genome with probability 
0/N . On average, there arc of course 9 mutants in the whole population at each generation. The assumptions made 
in the infinite allele model and their links to phylogenetics are discussed in Q and : it is an approximation which 
neglects in particular the possibility that two mutations occur on the same base pair. 

The results presented in this article are mostly derived in the limit of a large population. It is well known Q that, 
for large N, all the relevant times in the genealogy (like for example the age of the MRCA) scale like N. In the rest 
of this paper, wc will therefore count the number G of generations in units of N and define the time by t — G/N. 

In the remaining part of this introduction we recall some well known properties of the Wright-Fisher model that we 
will use later £|. If one considers a finite number n of individuals, the probability that these individuals have only p 
parents in the previous generation and that they undergo m mutations scales as \/N^ n ^ p ' +m : therefore if one goes 
back one generation, there is a probability 1 — (n(n — l)/2 + nO)/N that the n individuals have different parents and 
that their genomes are identical to those of their parents. Moreover, there is a probability nO/N of observing a single 
mutation among these n individuals and there is a probability n(n — 1)/(2N) that two among the n individuals have 
the same parent. Therefore, when the size N of the population is large and for n <C A 1 / 2 , only pairs of branches 
coalesce along the tree. The time T n to find the Most Recent Common Ancestor to these n individuals can be written 
as a sum of n independent times Ti : 

T n =T 2 +T 3 + ...+ T„ 

where n is the time spent between the i th and the (i — l) th coalescence on the tree. This allows one to calculate the 
distributions pi (r, ) , as shown in appendix ^ : 

Pi(n) = c ! e" c,T ' (1) 

where the coefficients q are defined by : 

a = ^ (2) 

The generating function of the coalescence time T n is therefore : 

n 

<e- AT ">=IIlf^ (3) 

From © , one can get the average and the variance of T n : 

(T n )=2(l--) and (T 2 ) - (T n ) 2 = -- A_i 2 + ^-| (4) 

j=i J 

One can notice that the distribution of T n remains broad even for large n. Although the expressions are derived 
for fixed n <C N and in the limit N — > 00, the limit n — > 00 in @ and coincides with what would be obtained by 



3 



setting n = N, i.e. by considering the time T to find the MRCA of the whole population : 

oc 

(e- XT ) = Y\_ T~~T — (5) 

It leads to the following expressions of the first two moments of the coalescence time : 

4tt 2 

(T} = 2 and (T 2 ) - (T) 2 = — - 12 ~ 1.159 .. . 

3 

and to the following stationary distribution p s t (T) : 

oo 

Pst (T) = ^(-lf(2p-l)c p e-^ T (6) 

V =2 

On the other hand, the stationary distribution of the genomes is given in p| : the probability that, among n 
individuals, the first n\ have the same genome, the next rii another genome, and so on until the last n k which have 
the k th genome, is given by Ewens' sampling formula : 



P gr oup S (m, ...,n k ) = -iM- (2g) fc - (7) 

T(n + 20) nin 2 ...nk 



where T(x) is the Eulcr T function and 9 the mutation rate. 

There are several approaches to calculate the statistical properties of the above model : either one can write 
recursive equations between successive generations and try to solve them, or one can count directly all the possible 
coalescences and mutations histories of a group. The first approach leads to a hierarchy of equations, whereas the 
second option reduces to a simple enumeration. Depending on which of these two approaches appeared to us the 
simpler to implement, we use alternatively both of them in the present paper. A coalescence history as described in 
appendix ^ consists in a tree structure, in which each step corresponds to a coalescence of two individuals chosen 
randomly among the n! < n which remain, and in a set of n — 1 times Tj between two successive coalescences. A very 
important simplification (shown in appendix^) which we will use over and over is that the shape (i.e. the topology) 
of the trees and the times Tj are independent random variables. 



II. STATISTICS OF THE DISCONTINUITIES OF THE COALESCENCE TIME OF THE POPULATION 



A. Numerical Simulations 



The Wright-Fisher model implemented for a population of N = 500 individuals shows interesting features for the 
evolution of the coalescence time T (see figure [21 for G = 5000 generations, corresponding to a normalized duration 
of At = 10). The evolution shows periods of linear increase, separated by discontinuous drops. Let us call D k the 
duration of the fc th linear increase and Hk the height of the drop following it. The distributions of the ^fc's and Hk's, 
measured over 9169 discontinuities, are shown in figure [3] Similar results were previously reported in [l|. 

The data of figure [21 indicate that the delays Dk and the heights Hk have an exponential distribution of average 1. 
The correlations can also be measured (error bars of order of 0.01) : 



(D k D k+1 ) - (D k )(D k+1 ) ~ -0.005 (8a 

(H k H k+1 ) -(H k )(H k+1 ) ~ -0.006 (8b; 

(H k ^D k ) -(H k ^)(D k ) ~ -0.002 (8c 

(H k D k ) - (H k )(D k ) ~ 0.84 (8d; 

(H k D k ^) -(H k )(D k ^) ~ 0.12 (8e 



This indicates that the only correlation seems to be between the H k and the previous D k . We try to understand these 
correlations below. 
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Figure 2: Evolution of the age T — G/N of the MRCA for a population of N = 500 individuals in the Wright-Fisher model 
over a rescaled duration At = 10, i.e. over NAt = 10iV generations (dashed line). Thick line : average T2 over the whole 
population of the coalescence time of two individuals ; thin line : average T3 over the whole population of the coalescence time 
of three individuals. One can see that discontinuities are anticipated by the decreases of the average coalescence time of two or 
three individuals. 



B. Distribution of delays between two discontinuities 

When N is large, simultaneous coalescence between groups of three or more individuals are negligible (order 1/N 2 ) 
at the top of the tree (i.e. for the last n coalescences with n <C V~N, only coalescences of pairs occur). Thus, as shown 
in figure ^ all the population in the present generation is generated by the two individuals A\ and A2 reached at the 
penultimate coalescence and thus it can be divided into two groups according to these two ancestors. A discontinuity 
appears in the age of the MRCA when one of the two groups generated by A\ and A2 has no offspring. The dynamics 
of the sizes iVj was studied by Serva in 0] who showed numerically that the delays Di have an exponential distribution : 

Pdciay(-D) = er D (9) 

consistent with the results of figure 

In order to derive ©, let us introduce the probability P s &me(to,t) that the MRCA of a population is the same at 
time to = and at time t (with t > i ), as in figure ^] 

As explained above, the population at time to can be divided into two parts of size N\ = xN and N2 = (1 — x)N 
according to the ancestors A\ and A2 from which they come. The sizes of these two groups are N\ and N2 = N — Ni 
and one can define the densities x = Ni/N and l — x = N2/N. At a given generation, x is a random variable in [0, 1]. 
Its stochastic evolution is given by Wright-Fisher rule (see fil for an analogy with brownian motion and its stationary 
distribution p{x) is uniform on [0, 1] for x of order 1 (see |l| or appendix lAl for a short derivation). There arc finite 
size correction to this uniform distribution near the boundaries for x — 0(1/ N) and 1 — x = 0(1 /N) ; we will not 
discuss them here as they have no incidence on what follows). 

The MRCA of the population at time t is the same as the one of the population at time to if and only if the 
ancestors A\ and A2 still have descendants in the population at time t. If m is the number of ancestors at time to of 
the population at time t, this means that some of these m ancestors should be present in both groups of size N\ and 
N2 coming from A\ and A2 (see figure 0}. As the probabilities for each of the m's to belong to the first or the second 
group are x and 1 — x, the probability that both groups contains at least one of these m ancestors is 1 — (1 — x) m — x m . 

If one introduces the probability z m (t — to) that the population at time t has m ancestors in the population at time 
to <t, the probability P same (t ,t) is given by : 

P MmB (t ,t)= / dxJ2^m(t-t )(l-(l-x) m -X m j (10) 
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Figure 3: Measured distributions of the discontinuities of the age T of the MRCA for a sample of 9169 discontinuities when 
the population size is N — 500 individuals. Top : histogram of the distribution of the delays Dk between two successive 
discontinuities ; the dashed line is the exponential distribution @. Bottom : histogram of the distribution of the jumps Hk at 
the discontinuities of T ; the dashed line is the exponential distribution (1311 . 
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Figure 4: Structure of the genealogical tree of the population when the MRCA is the same at to and t. The population at t 
must have ancestors at to in each of the two groups generated by Ai and Ai. 
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The functions z m (t) arc known |l6j |. They satisfy recursive equations: the probability that the number of ancestors 
at t < t of a population at t is m is the sum of the probability that this number is m at time t +dt with no coalescence 
among these m during dt and of the probability that there are m + 1 ancestors at to + dt with a coalescence between 
to and to + dt. Therefore the functions z m satisfy : 



rfr 



z m {r) = c m+ iz TO+ i(r) - c m z m {f) 



(11) 



The function Z\ (r) is known as it is related to the distribution © of the age T of the MRCA 

d 



dT 



zi(t) = p st (r) and zi(0) = 



The solution of (ITT|> is [l6j 



1 (r)=£(-l) 



p—m 



{2p- l)(m + p-2)! 
to!(to — — m)! 



(12) 



Using the normalization ^ m=1 z m (r) = 1 and the fact that x is uniformly distributed between and 1, one gets : 



P sa me(t ,t) = 1 - zi(t - t ) - 



m=2 



m + 1 



^](-l) p (2p- l)e~ c " (t - to) 

p=2 



(-l) m (m+p-2)! 



— ' (m + l)!(m — l)!(p — m)! 

771—2 x ' v ' v y 



(13) 



Using the identity : 



(-l) m (m+p~ 2)! J 1/3 ifp = 2 

2 (m+ l)!(m- l)!(p-m)! _ 1 1/2 if p > 3 



E 



one can see that all the exponentials in (|13(l vanish except the one for p = 2 and one obtains : 

p i7„ fi — p-(t-to) 

This shows that the delays Dk between two successive jumps are distributed according to : 

dP*: 



(14) 



Pdelay(D) 



dt 



'■(0,t = D) = e 



— D 



C. The coalescence times n as a Markov process 

Figure El shows the stochastic dynamics of the coalescence time r 2 . Actually, all the elementary times of figure 
n have similar dynamics. The coalescence times Ti are the waiting times between two successive coalescences in a 
genealogy (sec figure ^) and evolve when extinctions of lineages occur. For example, if the lineage of A2 in figure ^ 
gets extinct, then the new MRCA is A\ and the new time t 2 is the former T3. This change implies a global shift 
t[ = Ti+i for i > 2. On the other hand, if the lineage of Ai on the left gets extinct, the MRCA does not change but 
the Ti become r' 2 = t 2 + T3 and r[ = Ti + i for i > 3. 

More generally, one can consider the top of the genealogical tree of a population between the dates when the number 
of ancestors is 1 and n. In this part of the tree, there are n — 1 coalescence times t 2 , . . . , T n . The n leaves of the tree 
generate all the population in the present generation. The dynamics of the Tj is controlled by the extinctions of the 
n lineages coming from these n ancestors : whenever one of them gets extinct, some of the times Tj topple. 

Actually, the observed dynamics of the can be described by the large n limit of the following stochastic process : 
either no extinction occurs and the times Ti remain unchanged : 



Tj(t + dt) = Tj(t) with probability 1 



n{n — 1) 



dt 



(15) 
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Figure 5: Evolution of the delay T2 between the two oldest coalescences at the top of the genealogical tree of a population of 
100 individuals at time t. The dashed line corresponds to the age T of the MRCA : its shape is similar to figure H The study 
of the dynamics of T2 shows that, at random times depending on extinctions, the time T2 either increases by T3 or is reset to 
T3, so that the new t'i is given either by T2 + T3 or by T3. 




or an extinction occurs and, with probability pidt for 2 < i < n — 1, the times topple at rank i : 

Tj (t + dt) = Tj (t) for j < i 

n{t + dt) = Ti (t) + r i+1 (t) 

Tj(t + dt) = T j+1 (t) fori + l<j<n-l ' 
j n {t + dt) = e n (t) 

Moreover, with probability p±dt, for i = 1, all the times Tj are shifted : 

for 2 < j < n - 1 ^ i7 ^ 

In appendix [51 wc show that the toppling rates Pi are given by : 

Pi = i (18) 

To determine the dynamics of r„, wc need to specify e n (t) : the e n (t) are random numbers uncorrelatcd in time 
which must have the same average as r„ : (e„) = (t„) = 2/ (n(n — 1)). We will see however that, when n is large, the 
precise form of the distribution of the e„ plays no role as long as (e„) = (r n ). This feature can be understood because 
c n = Ti(n — l)/2 goes to infinity when n becomes large ; therefore the larger n is, the more often the time r n is reset 
and a new e„ enters the system ; in any time interval, the time r„ is reset so many times with many independent e n 
entering the system that only (e n ) matters because of the law of large numbers. 

The value of (e„) can also be understood through the stationary conditions : e n is added to the system with a rate 
c n whereas T2 is removed with a rate 1. The system can reach a stationary state only if (e n )c n = (T2) = 1. Another 
consequence is that the total coalescence time T(t) = J2i=2 T « increases on average by c n (e n )At = At during At when 
no discontinuity occurs, in agreement with the slope 1 observed in figures [3 and [S] 

These simple dynamics of the times t,; allow one to determine all the statistical properties of T(t) : its correlations 
at different times, the distribution of its discontinuities Hk and the distribution of the coalescence times T right before 
a discontinuity. 

First it is obvious that the distribution of delays between successive discontinuities of T is exponential. The toppling 
dynamics (|16|l imply also that, at a given time t, all the Ti(t) are sums of times Tj(0) with j < i and of e„'s. These 
sums do not overlap and thus, if the initial times Tj(t) are not correlated at t — 0, they remain uncorrelated at any 
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later times. However, any n depends only on previous Tj with i < j such that the only non zero correlations in this 
system are the Gij (t) with i < j defined as : 

Gi, j (t) = {T i (t)T j (0))-(n(t)){T j (0)) (19) 

A consequence of ijTBfr and l(T7|) is that : 

{Ti{t) with probability 1 — a+idt 

n{t) + T i+ i{t) with probability idt (20) 

Tj+i(t) with probability Cidt 

Therefore, Gij satisfies the following differential equation : 

d t G id (t) = -CiG itj (t) + c i+1 G i+1J (t) (21) 

The initial conditions correspond to the times Tj generated according to the the stationary distribution JQ) and thus 
one has Gij(0) = 5ij/cf. We have seen that Gij(t) = if i > j + 1 and this gives immediately the solution of 121(1 
for i = j : 

= 4e-^ (22) 
More generally, the Laplace transform Gjj(A) = e~ xt Gij(t)dt is given by the following product : 

In particular, the correlation function of the total coalescence time T(t) = 7^2 (t) + ^(t) + . . . can be written as : 

(T(i)T(0)>-(T(i)>(T(0)) = 2< r <W T i(°)>-^(*))^( )) 



OO OO 



= EE^w ( 24 ) 

In principle ((231 124(1 allow one to extract the explicit expression of the autocorrelation function of T. We will 
describe later an alternative method to determine this explicit expression. 

The dynamics (116(1 gives also the statistical properties of the Tj at the time of discontinuity. In particular, we are 
going to show that the distribution of the total coalescence time T(t) right before a discontinuity is equal to the 
stationary distribution (J5J. 

First, we remark from ((16(1 that a discontinuity occurs when t-2 is thrown out the system. More precisely the 
height Hk is equal to this T2 just before the discontinuity and the distribution of each Ti just after the jump is the 
distribution of r,+i before the jump. Moreover, if the process is started at time just after a discontinuity (i.e. we 
choose a discontinuity of T as origin of time), one can introduce a variable r)(t) defined as : 

1 before the next discontinuity of T ,„_. 
rj(t) = < J (25) 

I after the next discontinuity of T 

The dynamics ((16(1 implies that the average (77) decays exponentially as (rj(t)) = e~*. The introduction of 77 allows us 
to study what happens between discontinuities. In particular, the generating function G^'(X) = (e~ AT )b c fore of the 
coalescence time T(t) right before a discontinuity takes the form : 

/•OO 

G(-'(A) = (e- AT ) bo fo r c= / ( V (t)e- XT W)dt (26) 



From ((16(l .thc correlation function (r](t)e AT '*') satisfies : 

dt{T,(t)e- XT M) = -c n ( V (t)e- XT{t) ) + (c„ - l)(^) e - AT(t) )( e - Ae ") 
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Integrating over t, one gets for G' '(A) from (|26() : 

(27) 

where G<+>(A) = <e- AT ) 

after is th e generating function of the total coalescence time right after the discontinuity. 
On the other hand, the stationary distribution can also be written in terms of G^ + -*. The generating function 
( e -*TW) of T(f) satisfies : 

(e -A(T (i ) +dt)) = (1 _ Cnd i )(e -AT( t)) + ^ (e -AT( t))aftcr + (Cn _ 1)(e -AT(t) )(e -Ae„ ) (2g) 

Thus, the stationary generating function is given by : 

Comparing (|2*7|) and lO, we see that : 

G ( ->(A) = (e- AT ) bofOTC = (e- AT(t) ) st = G st (A) (30) 

This result, which we checked in our simulations, looks paradoxical : although T(f) reaches a local maximum when 
the MRCA changes, the distribution of T at these local maxima is the same as the distribution of T(t) over the whole 
range of time. In fact, one can show by similar calculations that the same is true for all the Tj's : their distributions 
right before a discontinuity of T are the same as the stationary ones. The case of T2 explains the properties of the 
drops Hk at the discontinuities of T, since the value of Hk is the value of t 2 just before the discontinuity. Their 
distribution is exponential : 

Phei g ht(# ) = er H (31) 

which is in agreement with the data of figure Moreover, the Hk are not correlated in agreement with 1)8 b|l. as if 
Hk = T2, then Hk+i is made of some Tj's with j > 3 at the time of the previous discontinuity. 

One also sees from (|16|) that, just after the discontinuity, Tj is replaced by r^+i just before the discontinuity, which 
was distributed according to the stationary distribution Q). Thus the distribution should be given by a formula 
similar to (JJJ starting only at I = 3. The comparison with (|29)l implies that the factor l/(c„ — (c„ — l)(e~ Ac ")) should 
become 1/(A + 1) for the large n limit. This is easily checked as e„ ~ l/n 2 and for large n : 

(e" Ae ") = l-A(e„)+o(l/n 2 ) 

This in particular shows that for large n, only the average of (e n ) matters. 

The analytical value of l|8df) can also be obtained using the toppling dynamics of the Tj. Using the variable rj(t) 
defined in l|25|) . the delay Dk is the time at which rj(t) goes to zero and the height Hk is the time T2 right before the 
drop. The correlation coefficient is given by : 

POO 

(D k H k ) = / t( v (t)T 2 (t))dt 
Jo 

This suggests to consider the functions ipi(\) = J °° e~ xt (r](t)Ti(t))dt, as the correlation coefficient is (DH) — —dfa/dX 
for A = 0. The coefficients (rj(t)Ti(t)) satisfy the following differential equation derived from (|To|) and ij2U|) : 

j t ( v (t) n (t)) = -cM^nit)) + (c l+1 - l){r](t)T i+1 (t)) (32) 

At t = 0, 77(f) is equal to 1 and is distributed according to p i+ i given by (Q, since we saw in l|17|l that there is a 
global shift of the Tj's at each discontinuity of T. This implies that the functions ipi satisfy the following recursion : 

A^(A) - — = -CiVi(A) + (ci+i - l)^ + i (A) (33) 

As we need the first derivative of ip2 in zero, we can expand tpi in powers of A : 

^(A) = -^— } («< - Xvi + 0(A 2 )) (34) 
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The coefficients Ui and Vi satisfy the following simple recursion derived from l|33[) : 

2 

Uj = — — + u l+1 (35a) 

1) 

Vi = ... % —z + Vi+i (35b) 
1) 

The term tp n +i is linked to the boundary condition e„ and one has ip n+ i(X) = (e„)/(l + A) so that u n+ i = (e„)(n + 
l)(n + 2)/2 and u n +x = (e„)(n + l)(n + 2)/2 (which are not negligible when n — > oo). Equations l|35a(l and (|35bf) give 
simple summation formulas for Ui and Vi : 

n 2 2 
u * = -r + (£n)(« + l)(« + 2)/2 — > - + 1 (36a) 

j(j - 1) n^o° z - 1 

" ?/■ 9tt 2 2 

^j(j-l) n-oo 3 l-l 

Finally, the expansion of ?/> 2 around A = gives the following correlation coefficient in good agreement with the 
measured value (|8d|l : 

(D k H k ) - (D k ) (H k ) = ^- - i ~ 0.8599 . . . (37) 



D. Correlation functions of the coalescence times between few individuals 

Consider a pair of individuals at generation t. One can define the time T^ % ^>{t) to find their first common 
ancestor (i.e. NT^ l, ^(t) is the number of generations to reach their first common ancestor). Similarly, one may 
consider three individuals k) at generation t and define the time T^ J,fc '(t) to find their first common ancestor. 
One can average these times over the whole population : 

T ^ = jJsE^'W (39) 

(40) 

Figure |2 shows the stochastic evolution of these averages T 2 (t) and T 3 (t) . We are now going to determine the 
correlation functions of these times (in order to avoid confusion, we will use lower case letters t for the usual time 
(oriented towards the future) and upper case letter T for ages (i.e. oriented towards the past)). 

To understand the correlations of T 2 (t) and T^t), let us look at two individuals i and j at generation t and two 
individuals k and I at generation 0. Their coalescence times are defined as T^ l '^{t) and T^ k ' l \0). There are two 
possibilities : 

• cither T^ 4 '-?) is smaller than t and the coalescence times T^'^(t) and T^ k '^(0) are independent, 

• T&rt is larger than t and the entanglement between lineages creates a correlation between T^ k:l \0) and Tw) —t. 
In the large population limit N — > oo, the probability that the ancestors of i and j are k or / goes to as 1/N ; 
thus, in the second case, the quantity ((T^' j \t) - t)T( fc > z ) (0)) is the average of the product of the coalescence 
times of two distinct pairs of individuals at the generation 0. 

As a result, the average over the population T 2 (t) of the coalescences times of two individuals T 2 '^ satisfies : 

(T 2 (t)T 2 (0)) = f P 2{T2)T 2 dT 2 x (T 2 (0))+e- t (t(T 2 (0)) + (T( 1 ' 2 )(0)T( 3 ' 4 )(0))) (41) 

The coefficient (T^(0)T^(0)) can be calculated by looking at the genealogy of only four individuals. Following 
appcndixEl the coalescence times T^ 1,2 \0) and T( 3,4 )(0) are sums of the three elementary coalescences times r 2 , T3 
and T4. These decompositions are shown in figure Averaging over the tree structures and the times Tj leads to : 



(T 2 (t)T 2 (0)) = l + le- t 



(42) 
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Figure 6: Genealogical trees of four individuals 1, 2, 3 and 4 and the corresponding decomposition of the coalescence times 
of individuals 1 and 2 on one hand, and 3 and 4 on the other hand. Up to symmetries, there are only these five types of 
decomposition : any other tree leads to the same type of decomposition (up to permutations of the labels or of the roles of 
(1,2) and (3,4)). The symmetry factors count these relabellings. 



A similar calculation of the coalescence time of three individuals leads to : 

(T 3 (f)T 3 (0)> = — + — e-*-— e" 3t (43) 

More generally the correlation functions of coalescence times T m would be a linear combination of e~ Cpt weighted 
by coefficients. The calculation of the correlation function of the T m becomes however more and more complicated 
with increasing m. We have only been able to determine the correlation function (T(t)T(O)) — (T(t))(T(0)) of the 
coalescence time of the whole population represented in figure [21 As for T2, one has to consider two cases : either the 
MRCA of the population at t is reached between and t so that T(t) < t, or the number of ancestors at is m > 2 
so that T(t) = t + T m (0) > t. If z m (r) is the probability (|12f) that the number of ancestors of the population after a 
duration r in the past is m , we have the following decomposition : 

(T(t)T(O)) = [ rz[(T)dT x (T(0)) + £ z rn (t)((t + 7™(0))T(0)) (44) 

J ° m>2 

where z[(t) = /o s t(r) = Prob(T = r) is the probability that the MRCA is reached at r. 

The coefficients (T m (0)T(0)) can be decomposed in a tree-depending combination of the elementary times tj (sec 
section [I] and appendix with : 

00 00 
r«(0)=5^r i and T(0) = J2 T i ( 45 ) 

i=q+l i=2 
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where q is the number of ancestors left from the whole population when the subgroup of size to has just coalesced. If 
o m oo ((() is the probability distribution of q, then one has : 



oo oo oo oo i — 1 



(r m (o)r(o)) = 5> m ,oo( 9 ) J2 E< t ^> = E [E< r ^>] [E a -.-w] ( 46 ) 

<j=l 1=^+1^=2 i=2 j=2 <7=1 

where the are independent random variables with exponential distribution JJJ. 

Let us define a m .„(g) as the probability that the number of ancestors of a group of size to + n is q at the time when 
the first subgroup of m individuals has just coalesced into a single ancestor. Writing all the possibilities for the first 
coalescence of the group of size to + n leads to the following recursive equation : 

nm)a rn n _i(q) 

a m , n {q) = (47) 

Cn+m 

The boundary conditions are the probability that coalescences occur only among the first to if q = n + 1 : 

, , tt c i m! (m - l)!n!(n + 1)! 
a m ,n(n+l) =11 = ■ — — ■ — 

■JZ Ci+n {m + n)\{m + n — lj! 

and the probability for to = 2 that two individuals coalesce once the n others are reduced to q : 

, . _ C2 | [ Cj + 2j _ 2 n + 3 

° 2 -^ ~ ^ 11 _ (g + 2)( g +l)n + l 

With these boundary conditions, one gets for the solution of 147|) : 

TO!n!g!(TO + n — q — l)!(m — 1)(to + n + 1) 

a m ,n(q) = -, — ; 777 ; -T77 —-. (48) 

(q + to)!(to + n — l)!(n — 5 + lj! 



and in the limit 



One can ckeck easily that 



, , qlmUm — 1) . , . 

a m ,o.(q) = / , \i (49) 
(to + g)! 



i— 1 ... 

Gm.oofa) = 1 - ~, ,' . ' TT. ( 50 ) 

(to + z — 1 ! 

q— 1 v 7 



Moreover, using (T|. the correlation between 73 and Tj is : 

1 



\'»'j7- (! + <%) (51) 

Using l|51[) and (|50|1 . the permutation of the sums in (|46f) gives the correlation coefficients (T m (0)T(0)) : 

J^L r 1 9.1 r i!m! 
(T m (0)T(0)) = Eb 



,=2 " C? 



(m + 1 - i)L 
1 2\ ^ m !(i_2)l ^ 4to!z! 



c~ Ci/ ^— '(m + i — 1)! ^— ' i 2 (« — 1) 2 (to + i — 1)! 

' i— 2 i— 2 



^ Vc 2 Ci/ ^(m + i-l)\ ^ 

i—2 1 

The calculations of the first two sums give : 

W 1 2 \ 47T 2 c 

m!(i-2)! 1 



i=2 
00 



E 



(m + i — 1)! m 

z—2 
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and (T m (0)T(0)) becomes : 

47T 2 4 4777'?' 

(r m (0)T(0)) = ^- - 8 - - - J2 W - U T TTi (52) 

3 m — z ( z — 1) (772 -)- 2 — 1 ) ! 

2 — 2 

Finally, using the normalisation J2m=i z m{t) = 1 and the fact that (T(0)) = 2, the integration of the first term of 
lHH leads to : 



(T(i)T(O)) = 2 / (1 - zx(i))di + J2 (T m (0)T(0))z m (t) 

By multiplying 1|11|) by 1/m and summing over m, one gets : 

,00 , 

171=2 

Since the sum X)m=2 z m(T)/m must vanish for large r, the solution of l|53|) is : 

00 -1 -1 />oo 

2 -*»(*) = (1 - *i(*)) - 2 / (! - ( 54 ) 

m=2 

Using ij^t and iJS^I one gets : 

( r (t )r(o)> = 4 + ( - 12) (i - Zl (t)) - £ £ i2(i _ 1)2(m + i _ 1) , ^) 

m=2 i=2 v 7 V 1 / 



If one collects the exponential terms e Cpt using l|12|) . the correlation function takes the following form 



(T(i)T(O)) - (T(t))<T(0)> =^(-ir(2p-l)A p e-^ (55) 

p=2 



with coefficients A p given by 



.1 ^ 12 VV 4i!(-ir( ro + p-2)! 

P 3 ^ i ^ 2 z 2 (i-l) 2 (m + i-l)!(m-l)!(p-m)! 1 ; 

One can show that the sum over m is given by : 

(m+p-2)!(-l) m _1 JO forp>?: 
(m + i- l)!(m- l)!(p-m)! i! \ (p+^Ki-p)! fOT * > P 

This identity gives finally the coefficient A p : 

A IV (Z " 2)!2 (57) 
A p ~ A 2^it i+p _!)!(,- ( 5? ) 

One can notice that A p — >0 when p -> 00. The correlation functions (T 2 (i)T 2 (0)) - (T 2 (i))(T 2 (0)), (T 3 (i)T 3 (0)) - 

(T 3 (i))(T 3 (0)) and (T(i)T(O)) - (T(i))(T(0)) are shown in figure[7| 

By a calculation not shown here, one can check that expressions (|55I57|I coincide with the one obtained from l|24|) 
and this confirms the validity of the Markov process defined in <|16|) and l|17[l • 



III. CORRELATION BETWEEN THE COALESCENCE TIME AND THE GENOMIC DIVERSITY 



So far we only considered the statistical properties of the coalescence times along the tree. We are going to study 
now how these times are correlated to the genetic diversity. 
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Figure 7: Correlation functions of the average over the population Ti — 1/(N 2 ) . T'*^ of the age of the MRCA for n = 2 
individual (dashed), of the average T 3 = l/(N 3 ) J2i fc T (ij '' fc) of the age of the MRCA of n = 3 individuals (thin) and of the 
age T(t) of the MRCA of the whole population (thick). 



The genetic diversity can be measured by different quantities according to the model one considers (see for example 
Tajima's estimator for the infinite site model 01). We consider here the case of an infinite number of alleles : any 
mutation creates a new allele which has never occured before. Thus, for two individuals chosen at random in the 
population, there are only two possibilities : either they have the same allele or they have different ones. Now we 
want to calculate the average age of the MRCA, conditioned on the fact that the two individuals chosen at random 
have (or not) the same genome. 

More generally, the population is divided into groups of individuals sharing the same genome, whose sizes charac- 
terize the genetic diversity of the population. The determination of the distribution of the age of the MRCA, given the 
size of these groups, is a difficult problem that we could not solve. Here we address a simpler version of this problem : 
suppose we have some information about the genes of a few individuals chosen at random in the population; what 
can be said about the age of the MRCA ? 

In the present case, we consider a group of n -C N individuals and we suppose that the first m of them have identical 
genomes. Of course, the n — m others may have the same genome or different ones : we suppose that we have no 
information about them. Knowing this partial information about the present generation, we look at the coalescence 
time of the whole group of n individuals. 

We first look at the probability distribution p m , n (T n ) of observing a group of size n whose coalescence time is equal 
to T n and in which the first m individuals have the same genome. The coalescence time T n of such a group of size n 
is the coalescence time of their parents at the previous generation plus one generation . The group of the parents is a 
group of size n' < n. At first order in l/N, the only possible events which may occur are a coalescence (n' = n — 1) or 
a mutation (n' = n). The probability of a coalescence among the first m individuals is c m /N = m(m — l)/2N ; in this 
case, the probability distribution of the coalescence time of the parents is p m _ ljrl . For other coalescences (probability 
(c„ — c m )/N), it is p m ,n—i- Moreover, no mutation must affect the first m individuals. Consequently, the probability 
distribution p m ,n(T) satisfies the following recursive equation : 

4pra,t.(r) = CmPm-l,n-l(T) + (c n - C m ^Pm.,n-l(T) - (c n + m6^p m ^(T) (58) 

where the c„ are the binomial coefficient (J3J). 

For m = 1, the distribution p\ >n is just the stationary distribution of T n related to (j2J). For n = m, p m ,m is the 
distribution of the coalescence time of a group of m individuals with the same genome. Its Laplace transform is : 

Pm,m{s) = / Pm,m{t)e~ st dt = TT ^ (59) 

JO f = 2 S + C » + ™ 
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The general solution of i|58[l . which we will give below in l|67|) . is difhcult to handle in general. Let us consider first 
the simple case m = 2 and define the parameter Y related to the genomic diversity as : 

Y = N{ N-1) ^ S ^^ (60) 

where g(i) is the genome of the individual i. Y doesn't count the number of differences between two sequences (as 
does Tajima's estimator |l4j ) since we do not suppose any information about the structure of the genome but just 
detects whether at least one mutation has occurred or not and can be interpreted as the fraction of pairs of individuals 
having the same genome. When Y is close to 1, the population is very homogeneous and all the individuals have very 
similar genomes whereas Y close to corresponds to a population where the genetic diversity is very large. From the 
definition of p m . n , one gets : 

fc.ooOO = (Ye- sT ) (61) 

where P2,oo(s) is the limit for large n of the generating functions P2,n( s ) that satisfy a recursion directly deduced from 
(S3 : 

P2,n( S ) = — i I oa (Pn-l(s) + (C„ - l)p 2 ,„_l(s)) (62) 

s + c n + 20 \ J 

where p n (s) = (e~ sTn ) is the generating function with no information J3J). The solution of l|62|) (which is a particular 
case of the general solution (|67|l given below) is : 

oo „ q oo 

^(.)H^>=£ ¥T 4 TTy n J ^ n.jT^ m 

q—1 v 7 v 7 i—2 



, a + c, + 26» 

J=q+1 



It allows one to determine the distribution of the coalescence time of the whole population, conditioned on the 
fact that two individuals chosen at random have the same genome. Moreover, successive derivations of (16311 in s = 
give all the correlation coefficients (YT k ). These coefficients measure how Y is an estimator well adapted to the 
determination of the age of the MRCA T. The following computation focuses on the properties of the average 
coalescence time (T\2 id.) knowing that two individuals chosen at random have the same genome. 

The average coalescence time T n of n individuals conditioned on the fact that two individuals chosen at ran- 
dom among these n have the same genome can also be obtained from (|58[l. The Laplace transform p m>n (s) = 
J e~ sT p m ,n {T)dT for s = gives the probability that the first m individuals of the group of size n have the same 
genome (see (0)- Thus the normalized quantity p m ,n(s)/p m ,n{0) is the generating functions of the coalescence time 
of n individuals conditioned on the fact that m individuals chosen at random among them have the same genome. 
For m = 2, one has P2,n(0) = 1/(1 + 29). The average conditioned time is the derivative of p m ,n(s) /p m ,n(0) for s = : 



u n {0) = (T n \ 2 id.) = -(1 + 29)-^-p 2 , n (s) 



ds 

By taking the derivative of Ij62(l one gets : 

1 / n-2 



s=0 



«n(0) = 7^ f 1 + 2 (! + 2d )^~{ + ( c » - IK-iW) ( 64 ) 

Cn + 20 \ n — 1 / 



The initial condition is given by the coalescence time of 2 individuals with the same genome : 

1 



u 2 (0) = 



1 + 20 



The general solution of (|64f) is given by 



2(n-2) 1 A HT (n + l)!(n-2)! (2p - l)(p + l)(p - 2) 
nlj ~ n-1 + l + 20 + ^ 3 c p + 20(n + p-l)l(n-p)\ 2 

If = 0, all the individuals have the same genome and the value of u n (0) for = is just 2{n — l)/n as given by 
0. The large n limit of (performed by considering u n (0) — u n (0) to regularize the series) leads for the average 
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Figure 8: Average coalescence time of a whole population of large size knowing that two individuals chosen at random have 
the same genome (left) or different genomes (right). Without conditionning on the genomes of the two individuals, the average 
coalescence time would be (T) = 2. 



coalescence time (T|2 id.) of a whole population conditioned on the fact that two individuals chosen at random have 
identical genomes to : 

The 8 dependence of this average coalescence time is shown in figure 03 Although Y is a rough estimator of the 
genetic diversity and we consider only information about two individuals, (T\2 id.) is shifted up to 5% compared to 
the case of no information. 

One can write down a general expression for the Laplace transform p m . n (s) = p m ^ n (t)e~ st dt of the solution of 
© : 



S(n) 

l<ni<...<n m =n v ' i=2 



ft B1 „w= e (^^) x (n /<({ni},a) ) (6?) 

with functions fi({nj}, s) defined as 

; 4 K, S )H s ^ +je r ■ ' . (68) 

and amplitudes : 

_, , S(m)(n + m — l)!(n — m)\ '^-j 1 1 
B n,™K) = ™=I 11 




This result can be obtained by counting trees and averaging on coalescence times Tj as shown in appendix [S] Let 
us sketch briefly the derivation of (|67(l . The genealogy of the group of n individuals can be divided into several parts 
which correspond to a constant number of ancestors of the subgroup of m individuals, i.e. the parts are separated 
by coalescences among the ancestors of the m individuals. The indices rij in (|67|l are the number of ancestors of the 
n individuals at the times of these coalescences, i.e. when the number of ancestors of the m individuals decreases 
from j + 1 to j due to a coalescence. The quantity B n _ rn (nj)S(n + m) counts the number of trees sharing the same 
parameters rij and thus the sum over the m in (|67|l is an average over the shape of the trees. The value of B n ^ m {rij) 
can be obtained by counting at each coalescence the number of possibilities compatible with the value rij . 
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Figure 9: Probability distribution of the coalescence time of a group of individuals knowing that the first m of them have the 
same genome. Solid line : stationary distribution for a large group without information. Dashed (long) : 9 — 0.5 and m — 2 
in a large group. Dashed (short) : 9 — 0.5 and m = 5 (numerical simulations for a population of 50 individuals). 

Given a set of parameters n^, we now consider the distribution of the coalescence times Ti conditioned by the shape 
of the tree and the genomes of the subgroup of m individuals. Mutations are forbidden in the subtree of the m 
individuals. Thus, if the number of ancestors of the m individuals is j during r,, the probability that no mutation 
occur is e _J Ti . If one introduces the parameters rii, the probability that the delay between the (i — l)-th coalescence 
and the i-th is and that no mutations occur on the lineages of the m individuals with the same genome is fiijij, t) 
defined as : 



The Laplace transform of these expressions gives the result 168|) and the product of the /j in (|67[) corresponds to the 
average on the Tj's. 

Figure [5] shows the distribution P2,oo(£)/P2,oo(0) of the conditioned coalescence time T obtained from (|63|1 . It 
also shows numerical results on a population of 50 individuals which agree with analytical calculations showing how 
information about five individuals modifies the coalescence time of the whole population significantly. 



In the present paper, we have shown that the evolution of all the coalescence times at the top of the genealogical tree 
can be described by a Markov process (section 111 C|) . This Markov process allowed us to calculate various properties 
(|24l 1271 I5U1 I37f) of the age of the MRCA, in particular its autocorrelation function (|55l I57fl . We have also shown how 
to calculate the correlation between the age of the MRCA and a parameter representing the genetic diversity (section 
IIIII) . Our general formula l|67|) , correlating the age of the MRCA of n individuals knowing that a sample of p of them 
chosen at random have the same allele, is not easy to manipulate. Its interpretation as a weighted sum over a large 
number of tree configurations may however allow numerical simulations with Monte-Carlo methods || by sampling 
efficiently the terms of the sum. 

The Markov property of the genealogies is the most promising result of this paper and one may hope to construct 
more general Markov processes of this type. A first direction would be to try to incorporate the genetic diversity in 
the Markov process : whereas section Hill leads only to the stationary correlation coefficients (YT k ), the construction 
of a joint Markov process for the times Ti and the sizes of the families may lead to correlations at different times and 
establish links between extinctions and variations of the genetic diversity. Moreover this could be related to works 
such as in the case where sampling the DNA of individuals at different times is possible. 

Extensions of the Markov process to more realistic models would also be interesting but many aspects of the 
calculations may differ. For example, the shape of the genealogical trees changes in presence of selection since multiple 




for rij > i > Uj-i + 1 and j > 1, 
for ri\ > i. 



(69) 



IV. CONCLUSION 
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coalescences [2J, |25|, |26j have to be included and this should change the weights of the trees and the probabilities of 
extinctions of families. The study of structured populations [H|, UM shows that demographic and geographic effects 
are important : it would be interesting to know if the Markov property of the coalescence times persists, up to changes 
in the transition rates. Diploidy (T2H13I I is more problematic since it has more radical effets (e.g. the age of the MRCA 
scales as log AT and not as N anymore) because genealogical trees have a more complicated structure with loops. 

Lastly, it would be interesting to see how more detailed information about the genomes could lead to a more 
accurate estimation of the age of the MRCA. Analysis of section II I II deals with only one gene. Distinct genes may 
evolve in different ways since the MRCA and, in the present generation, one is left with different parameters Y for 
each gene. Information about the genetic diversities for different genes would modify the distribution of the times 
Tj in order to account for possible differences in the number of mutations of each gene. Moreover, in real cases, the 
observation of different genes along a DNA sequence would be incomplete if recombination 0, [23, is not taken 
into account. Recombination acts as if the two genes of a given individuals are not inherited from the same parent. It 
implies that the genealogical trees of the two genes will have some different branches and the MRCA may be different 
for the two genes and the difference of ages between these ancestors may be worth further investigations. 
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Appendix A: MEASURE ON GENEALOGIES 

In this appendix we recall briefly the derivation of the statistics 0,0, El of the coalescence times in the genealogies 
of a set of individuals. The problem can be divided into two aspects : the distribution of the coalescence times and 
the shape of the tree. 

We consider a group of n individuals undergoing coalescences until they reach their MRCA. Each coalescence is 
characterized by two quantities : the waiting time until it occurs and the pair of individuals which coalesce. 

For a large size N of the population, coalescences occur one after another. At each generation, the probability of a 
coalescence between a given pair of individuals is 1/N and the total probability of observing a coalescence is c n /N for 
a group of n individuals, where the coefficient c„ is defined as c n — n(n— l)/2 (see ©). The probability of observing 
the first coalescence at generation G in the past is then p n {G) = c n /N(\ — c n /N) G which becomes for the rescaled 
time t = G/N : 

Pn{r) = c„e~ c " T (Al) 

After this coalescence, we are left with n — 1 individuals and the rescaled time r„_i before the next coalescence 
is then given by p n -\(jn-\) and so on. So, the distribution of the (n — 1) waiting times between two successive 
coalescences for a group of n individuals is : 

n 

P„(r„,...,r 2 ) =Y[ Cl e- c ^ (A2) 

i=2 

Consequently, the total coalescence time can be written as a sum of n — 1 independent variables : 

n 

T n =J2 T i ( A3 ) 

i=2 

Once the dates of the coalescences are known, we have to decide which branches coalesce at each step. We will 
consider here that a tree T is completely characterized by its topology and the chronological order of these n — 1 
coalescences. With this definition which is convenient for our calculations, the two trees shown in figure^|arc distinct. 

The total number of such ordered trees is thus 

i=2 
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and 
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Figure 10: Two genealogical trees. Their topologies are identical but the chronology is different. 

and they are all equally likely. The probability measure p, n of a given genealogy factorizes as : 

Mn(T,{r<}) = -^-Pn(T n )Pn-l{T n -l) ■ ■ ■ P2(t 2 ) (A5) 

S(n) 



For a given tree, one can determine from (|A5|) for each ancestor on a branch of the tree, the distribution of its 
number of descendants in the present generation. For example, right before the last coalescence, the ancestors of the 
group of size n consists of two parents who have in the present generation p and n — p descendants respectively. The 
sizes p and n — p of these two groups can be obtained by counting the number s(n, p) of trees satisfying this constraint. 
The probability p n (p) of observing the subdivision (p, n — p) with 1 < p < n — lis given by : 

Pn(p) = S 44 = ^S(p)S(n-p)( n ) ( n - 2 \ (A6) 



S{n) S{n) \pj \p— 1 

The binomial coefficient (™) counts the number of ways of making the groups of p and n—p individuals, the coefficients 
S(p) and S(n — p) count the number of subtrees for each groups and the factor (™~^) counts the ways of organizing 
the chronological order between the coalescences of the two subtrees. The dependence on p disappears in l|A6(l and 
p n (p) is the uniform distribution : 

Pn( P ) = ~ 7 (A7) 

n — 1 

One should notice that this result is obtained for a large population N and a group of size n <C N, such that 
coalescences occur only between pairs of individuals and not more. However, if n is large enough and if we define the 
density x = p/n, the corresponding distribution p{x) is uniform on [0, 1]. 

For a branch of length r, the number m of mutations has a Poisson distribution : 



t 



P(r, m) = j — e 



-0t 



So the probability of observing no mutation on this branch, which is the only relevant quantity in the infinite allele 
case) is given by : 

Pnomut(r) =e- 0T (A8) 



Appendix B: DYNAMICS OF THE TIMES n 

Figure shows the stochastic dynamics of the coalescence time t-i. Actually, all the elementary coalescence times 
t, of figure n defined in appendix lAl have similar dynamics : either they increase by Tj + i or they are reset to Tj_|_i. 
The idea of a Markov process in genealogies is not new and some features are presented in [2^ . 

If one considers a generic tree as shown in figure ^ truncated below t„, one sees that the times r, topple when some 
lineages coming from the n ancestors at the "leaves" of the truncated tree disappear. Let us assume that the lineage 
of a given ancestor among these n disappears and that this ancestor is directly connected to the j-th coalescence, i.e. 
the coalescence separating Tj and Tj+i. For example, if j = 1, the ancestor is directly connected to the MRCA and if 
j = 2, it is directly connected to A\ in figure ^ If the lineage of this ancestor in the present generation disappears, 
the times Tj topple at rank j + 1, i.e. they are redefined as : 

t[ = Tj for i < j 

< rj = Tj + r j+1 
t[ = t 1+ i ioxi>j 
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Let us call P no do(", j) the probability that a given ancestor among the n is directly connected to the node of the 
j th coalescence, i.e. the lineage of this ancestor does not participate at any coalescence until the number of ancestor 
reaches j. With these notations, the probability pjdt defined in H18J) that the times topple at rank j is the probability 
that the lineage which disappears during dt (probability a n dt) is the lineage connected to the j th coalescence and 
thus it is given by : 

Pjdt = P no de(n, j)a n dt (Bl) 

The value of a n can be derived by introducing the probability Qt(n, to) that the number of ancestors at time to < t 
of the whole population at time t is n. These n ancestors at time to generate all the population at time t which can 
be divided in n groups, each depending on the ancestor they come from. At time t + dt, either one of these groups 
gets extinct (probability a n dt) and the number of ancestors at to of the population at t + dt is n — 1, or this number 
is still equal to n. The probabilities Qt(n,to) satisfy the following equations : 

Qt+dt(n - I, h) = Q t (n,t )a n dt + Q t (n - l,i )(l - a n -idt) 

It gives the differential equation : 

-^Qt(n, t a ) = Qt(n, t a )a n - Q t (n - 1, t )a n -i (B2) 

In the stationary regime, this probability is Qt(m, t — r) = z n (r) where the z m 's have been defined in section Til CI and 
satisfy Comparing (|B2(I and (|llfl leads to : 

a n = c n (B3) 

The probability -P n odc("-,.7) is the probability that, in the genealogy of a group of n individuals, the lineage of a 
given individual among the n is directly connected to the node of the j-th coalescence, i.e. that coalescences do not 
involve its lineage until the number of ancestors of the group is reduced to i + 1. Counting the number of possibilities 
for each coalescence gives : 

r> / -\ j TT Ck-jk - 1) j 

Pnod c (n,]) = = — (B4) 

J + L k=j+2 K ™ 

Putting (HU) and (B3l in (jBlJ gives the toppling rates presented in l|18|l : 

i 



Vi = — x c n = i (B5) 
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